Introduction

The relevance of house price and neighborhood characteristics has long been recognized. For example, public infrastructure investment, high qualifications of block residents, low crime rate benefit to adjacent properties. Interrelated urban housing submarkets might form a unitary urban housing market. There are four common factors in the submarkets: housing structural attributes, spatial attributes (the housing location), demander groups. Meanwhile, the joint influence of structural and spatial attributes could be the fourth factor of the submarkets.

However, predicting home prices is challenging. The factors affecting the price of the house are different due to the different areas of the house. In this project, we modeled the housing prices and related factors in the San Francisco area. The models we develop predict the price of homes in San Francisco based on local conditions and characteristics.

We collected the internal structure of the San Francisco house and the spatial structure of the street. And through the census data we obtained information on the local population situation and crime rate. Based on the above various data, we have initially completed our San Francisco house price forecasting model.

Data

Methods of Data Collection

we collected our data from mainly two resources: DataSF and Social Explorer. From former website, we mostly collected spatial datas like distributions of crimes, trees, schools, neighborhoods’ boundaries, restaurants, parking lots etc. From latter one, we collected census data based on research boundaries we choose and spatial join them to the boudaries. We think characteristics of people live in neighborhoods have strong potential impact on house price. Last, we considered spatial lags and neighborhood fixed effect.However, these two variables do not brought great improvement to our model. ## Description of Predictors The table and plot below present summary statistics and distributions of values for our outcome variable and the 12 predictor variables in our model.

Variable Type Category Description Min Median Max
Asian Continuous Demographic characteristic percentage of Asians 0 0.1 0.3
Bachelor Continuous Demographic characteristic percentage of Bachelors 0 0.2 0.3
Master Continuous Demographic characteristic percentage of Masters 0 0 0.1
gas_used Continuous Demographic characteristic percentage of gas-used households 0.1 0.7 0.8
electricit Continuous Demographic characteristic percentage of electricity-used households 0 0.1 0.2
Median_GrI Continuous Demographic characteristic Median gross income 216 961 1097
Average_Co Continuous Demographic characteristic Average gross Income 19 29 31
pop Continuous Demographic characteristic population of each neighborhoods 231 3906 5140
pop_den Continuous Demographic characteristic population density of each neighborhoods 318.8 15087.2 20130.8
medinc Continuous Demographic characteristic median household income 22265 86601 95601
avg_hsinc Continuous Demographic characteristic average household income 24134.7 69320.6 77834.8
agg_hsinc Continuous Demographic characteristic aggregate household income 2921900 112877700 170305200
inc_perca Continuous Demographic characteristic income of per capita 7428 21342 30367
bycar Continuous Demographic characteristic percentage of people work by car 0.1 0.6 0.6
byfoot Continuous Demographic characteristic percentage of people work by foot 0 0 0
mhh_child Continuous Demographic characteristic number of children 0 44 62
med_age Continuous Demographic characteristic median age of neighborhood 26 36 39
LotArea Continuous Internal characteristic Property area of Lot (sqft) 0 187500 250000
PropArea Continuous Internal characteristic Property area of home (sqft) 0 1150 1487
Stories Continuous Internal characteristic Number of stories 0 1 1
Rooms Continuous Internal characteristic Number of rooms 0 5 6
Beds Continuous Internal characteristic Number of beds 0 0 2
Baths Continuous Internal characteristic Number of baths 0 1 2
SalePrice Continuous Outcome Variable sale price 100001 695001 930003
lagPrice15 Continuous Spatial characteristic Avg price of 5 nearest home sales 236868.6 688622.2 969502.1
crime.Buffer Continuous Spatial characteristic Number of crimes within 1/8 mile 3 82 109
crime_nn5 Continuous Spatial characteristic Avg distance of 5 nearest crimes 30.4 76.1 93.3
rest.Buffer Continuous Spatial characteristic Number of restaurants within 1/8 mile 4 38 62
schl_nn5 Continuous Spatial characteristic Avg distance of 5 nearest crimes 87.4 374.6 472
tree_nn5 Continuous Spatial characteristic Avg distance of 5 nearest trees 3.2 21.3 27
bus_nn5 Continuous Spatial characteristic Avg distance of 5 nearest buses 39.4 124.7 163.2
parking_nn5 Continuous Spatial characteristic Avg distance of 5 nearest parkings 14.1 224.6 415.9
nbor Categorical Spatial characteristic neighborhood name
SaleYr Categorical Temporal Year the house was sold
BuiltYear Categorical Temporal Year the house was Built
BuiltYear Continuous NA NA 0 1913 1929
SaleYr Continuous NA NA 12 12 13

Correlation Matrix

We use correlation plot to choose variables with high coefficients ( >.9) and select them out.Remove some of them based on their p-value from summary of our first model to minimize collinearity among the variables in our model, as collinear variables can rob each other of predictive power.

Home Price Correlation Scatterplots

Average price of 15 nearest homes (lagPrice15), Avg distance of 5 nearest trees (tree_nn5), median gross income (Median_GrI), and income per capita (inc_perca) were four of our most important variables. By plotting them against sale price, we see positive correlations besides Avg distance of 5 nearest trees (tree_nn5).

Map of Dependent variables(Sale Prices)

## Map of Dependent variables (plots)

Map of Dependent variables (spatial map)

Methods

We created a multiple linear regression model to predict house price.After select predictors we think that might have potential impact on house prices. We use correlation coefficients to select some predctors out to prevent colinearity. Then we use simple feature engineered some variables like number of bathrooms, number of bedrooms etc. We categorized them into several types because treat these variables might not be the best way to improve our model’s accuracy.Next, we are going to test if there’s spatial lag exist here,namely,Do model errors exhibit spatial autocorrelation? Thus we calculate the average prices of nearest 15 houses of each house point.And the result tells us that spatial autocorrelation does exists.The observed Moran’s I of 0.1342575 seems marginal but the p-value of 0.001 suggests that model errors exhibit greater spatial autocorrelation than what would otherwise be expected due to random chance alone.Additionally, prices and errors seem to vary across neighborhoods. We considered that there is a ‘neighborhood effect??? that can help predict variation in price. So we considered neighborhoods of San Francisco in our model but turned out that it did not have much help to the accuracy of our model.

result

Test Set Prediction

The table below shows the MAE, MAPE and R-square value of our model.

On average, our predictions were off by about $250,000. And our predictions were off by about 25.4%. Our model accounted for 71.5% of the variation in the sale price in the test set.

Error Values and R-Squared for Test Set Predictions
MAE MAPE R-Squared
251753 25.41 % 0.7148072

Cross-Validation

The cross-validation results show evaluation statistics about our predicted house price values and regular sale prices. The multiple R-squared is 71.48%, and the adjusted R-squared could be 71.22 %.

Plot of Prediction results

We start with a histogram showing the error values of the distribution. According to the results of the model, we can intuitively see that the error distribution is not uniform. For some houses, our error value will be very large. Despite this, most of the error values of our model are relatively small.

We also created a density histogram, which has the advantage that we can better see the distribution of the data. According to the figure, we can see that our model and the real house price are still purely error, especially when it is close to the average. But the distribution pattern of our model is close to the normal distribution. The prediction has a amonunt of deviations. The orange line indicates the perfect fit line of aligned points to the model, and the green line presents the average predicted fit of the model.

spatial lag

There is a spatial correlation between housing price data.

As the sales price error increases, the nearby price error increases.

As home sales prices rise, the price of nearby homes will also rise.

Moran I

We calculated a Moran I of 0.12634, which indicates a slight accumulation of residuals in our test set. The p value is 0.001, indicating that a small number of clusters is larger than the cluster expected only by random chances.

nbor meanPrice meanPrediction
Bayview Hunters Point 524043.7 524043.7
Bernal Heights 1092916.9 1092916.9
Castro/Upper Market 1709029.8 1709029.8
Chinatown 1240001.9 1240001.9
Excelsior 656803.3 656803.3
Financial District/South Beach 1295820.0 1295820.0
Glen Park 1380719.7 1380719.7
Haight Ashbury 1666201.1 1666201.1
Hayes Valley 1269453.6 1269453.6
Inner Richmond 1576612.6 1576612.6
Inner Sunset 1265486.6 1265486.6
Japantown 584501.0 584501.0
Lakeshore 898960.8 898960.8
Lincoln Park 1000002.0 1000002.0
Lone Mountain/USF 1316757.0 1316757.0
Marina 2291989.6 2291989.6
McLaren Park 472552.2 472552.3
Mission 1174369.8 1174369.8
Mission Bay 932948.0 932948.0
Nob Hill 1522764.9 1522764.9
Noe Valley 1757886.1 1757886.1
North Beach 1417865.4 1417865.4
Oceanview/Merced/Ingleside 688476.3 688476.3
Outer Mission 732060.5 732060.5
Outer Richmond 1158357.1 1158357.1
Pacific Heights 2279064.8 2279064.8
Portola 673652.9 673652.9
Potrero Hill 1252597.4 1252597.4
Presidio Heights 2209032.7 2209032.7
Russian Hill 1906431.7 1906431.7
Seacliff 2557626.0 2557626.0
South of Market 885257.9 885257.9
Sunset/Parkside 888651.5 888651.5
Tenderloin 3400002.5 3400002.5
Twin Peaks 1300414.3 1300414.3
Visitacion Valley 565451.3 565451.3
West of Twin Peaks 1261246.5 1261246.5
Western Addition 1129470.7 1129470.7

Map of Test Set Residuals

MAPE

Scatterplot of Average Neighborhood MAPE

Generalizability across Race and Income

Errors for test set sale price predictions by neighborhood racial and income contexts
Context MAE MAPE
High Income 311225.1 0.2504901
Low Income 190969.2 0.2613050
high poverty rate 229464.3 0.2754896
low poverty rate 271319.8 0.2374056

Discuss

Although our model is more accurate in some cases, the model is not suitable for use. The accuracy of our model is not sufficient to predict all real rates in the San Francisco area. Therefore, our model cannot be considered as a valid model. In the process of creating the model, we found that as the data increased, our model became more accurate. So in the future we hope to collect more data. For example, we want to collect data related to high road and housing prices. Most houses with higher house prices are clustered together, but this is not obvious in our model. Probably because our Moran value is 0.14, this value is closer to 0, which indicates that our model accounts for most of the price change in the price.

Conclusion

I think our model is not suitable for predicting housing prices in San Francisco. We would not recommend our model to zillow. In the alien model, we should use spatial lag to quantify spatial autocorrelation instead of making its residual. Using logarithm to transform data in the OLS model is more effective when modeling. In the future model we need to add more excellent related variables, this approach can make our predictions closer to the true value.

One thing that our model lacks is reliable information about the characteristics of the house being analyzed. For example, information about the seller, information about the buyer. This information can help us create models better. Therefore, more information about housing is needed to better inform our models.